RegExMatch() [v1.0.45+]


Determines whether a string contains a pattern (regular expression).

FoundPos := RegExMatch(Haystack, NeedleRegEx [, UnquotedOutputVar = "", StartingPos = 1])

Parameters

FoundPos RegExMatch() returns the position of the leftmost occurrence of NeedleRegEx in the string Haystack. Position 1 is the first character. Zero is returned if the pattern is not found. If an error occurs (such as a syntax error inside NeedleRegEx), an empty string is returned and ErrorLevel is set to one of the values below instead of 0.
Haystack The string whose content is searched.
NeedleRegEx The pattern to search for, which is a Perl-compatible regular expression (PCRE). The pattern's options (if any) must be included at the beginning of the string followed by an close-parenthesis. For example, the pattern "i)abc.*123" would turn on the case-insensitive option and search for "abc", followed by zero or more occurrences of any character, followed by "123". If there are no options, the ")" is optional; for example, ")abc" is equivalent to "abc".
UnquotedOutputVar

Default mode: OutputVar is the unquoted name of a variable in which to store the part of Haystack that matched the entire pattern. If the pattern is not found (that is, if the function returns 0), this variable and all array elements below are made blank.

If any capturing subpatterns are present inside NeedleRegEx, their matches are stored in an array whose base name is OutputVar. For example, if the variable's name is Match, the substring that matches the first subpattern would be stored in Match1, the second would be stored in Match2, and so on. The exception to this is named subpatterns: they are stored by name instead of number. For example, the substring that matches the named subpattern (?P<Year>\d{4}) would be stored in MatchYear. If a particular subpattern does not match anything (or if the function returns zero), the corresponding variable is made blank.

Position-and-length mode: If a capital P is present in the RegEx's options -- such as "P)abc.*123" -- the length of the entire-pattern match is stored in OutputVar (or 0 if no match). If any capturing subpatterns are present, their positions and lengths are stored in two arrays: OutputVarPos and OutputVarLen. For example, if the variable's base name is Match, the one-based position of the first subpattern's match would be stored in MatchPos1, and its length in MatchLen1 (zero is stored in both if the pattern was not matched or the function returns 0). The exception to this is named subpatterns: they are stored by name instead of number (e.g. MatchPosYear and MatchLenYear).

Within a function, to create an array that is global instead of local, declare the base name of the array (e.g. Match) as a global variable prior to using it. The converse is true for assume-global functions.

StartingPos

If StartingPos is omitted, it defaults to 1 (the beginning of Haystack). Otherwise, specify 2 to start at the second character, 3 to start at the third, and so on. If StartingPos is beyond the length of Haystack, the search starts at the empty string that lies at the end of Haystack (which typically results in no match).

If StartingPos is less than 1, it is considered to be an offset from the end of Haystack. For example, 0 starts at the last character and -1 starts at the next-to-last character. If StartingPos tries to go beyond the left end of Haystack, all of Haystack is searched.

Regardless of the value of StartingPos, the return value is always relative to the first character of Haystack. For example, the position of "abc" in "123abc789" is always 4.

ErrorLevel

ErrorLevel is set to one of the following:

Options (case sensitive)

At the very beginning of a regular expression, specify zero or more of the following options followed by an close-parenthesis. For example, the pattern "im)abc" would search for abc with the case-insensitive and multiline options (the parenthesis can be omitted when there are no options). Although this syntax breaks from tradition, it requires no special delimiters (such as forward-slash), and thus there is no need to escape such delimiters inside the pattern. In addition, performance is improved because the options are easier to parse.

i Case-insensitive matching, which treats the letters A through Z as identical to their lowercase counterparts.
m

Multiline. Views haystack as a collection of individual lines (if it contains newlines) rather than as a single monolithic line. Specifically, it changes the following:

1) Circumflex (^) matches immediately after all internal newlines -- as well as at the start of haystack where it always matches (but it does not match after a newline that ends the string).

2) Dollar-sign ($) matches before any newlines in the string (as well as at the very end where it always matches).

For example, the pattern "m)^abc$" matches the haystack "def`r`nabc" only because the "m" option is present.

The "D" option is ignored when "m" is present.

s DotAll. This causes a period (.) to match all characters including newlines (normally, it does not match newlines). However, when the newline character is at its default of CRLF (`r`n), two dots are required to match it (not one). Regardless of this option, a negative class such as [^a] always matches newlines.
x Ignores whitespace characters in the pattern except when escaped or inside a character class. It also ignores characters between a non-escaped # outside a character class and the next newline character, inclusive. This makes it possible to include comments inside complicated patterns. However, this applies only to data characters; whitespace may never appear within special character sequences such as (?(, which begins a conditional subpattern.
A Forces the pattern to be anchored; that is, it can match only at the start of haystack. Under most conditions, this is equivalent to explicitly anchoring the pattern by means such as "^".
D Forces dollar-sign ($) to match at the very end of haystack, even if haystack's last item is a newline. Without this option, $ instead matches right before the final newline (if there is one). Note: This option is ignored when the "m" option is present.
J Allows duplicate named subpatterns. This can be useful for patterns in which only one of a collection of identically-named subpatterns can match. Note: If more than one instance of a particular name matches something, only the leftmost one is stored. Also, variable names are not case-sensitive.
U Ungreedy. Makes the quantifiers *+?{} consume only those characters absolutely necessary to form a match, leaving the remaining ones available for the next part of the pattern. When the "U" option is not in effect, an individual quantifier can be made non-greedy by following it with a question mark. Conversely, when "U" is in effect, the question mark makes an individual quantifier greedy.
X PCRE_EXTRA. Enables PCRE features that are incompatible with Perl. Currently, the only such feature is that any backslash in a pattern that is followed by a letter that has no special meaning causes the match to fail and ErrorLevel to be set accordingly. This option helps reserve unused backslash sequences for future use. Without this option, a backslash followed by a letter with no special meaning is treated as a literal (e.g. \g and g are both recognized as a literal g). Regardless of this option, non-alphabetic backslash sequences that have no special meaning are always treated as literals (e.g. \/ and / are both recognized as forward-slash).
P Position mode. This causes RegExMatch() to yield the position and length of each match rather than the matching substring. For details, see OutputVar above.
S Study the pattern to try improve its performance. This is useful when a particular pattern (especially a complex one) will be executed many times. If PCRE finds a way to improve performance, that discovery is stored alongside the pattern in the cache for use by subsequent executions of the same pattern.
`n Switch from the default newline character (`r`n) to a solitary linefeed (`n), which is the standard on UNIX systems.
`r Switch from the default newline character (`r`n) to a solitary carriage return (`r). The option `r`n is also recognized: it means CRLF (which is the default).

Note: Spaces and tabs may optionally be used to separate each option from the next.

Performance

To search for a simple substring inside a larger string, use InStr() because it is faster than RegExMatch().

To improve performance, the 100 most recently used regular expressions are kept cached in memory (in compiled form).

The study option (S) can sometimes improve the performance of a regular expression that is used many times (such as in a loop).

Remarks

A subpattern may be given a name such as the word "Year" in (?P<Year>\d{4}). Such names consist of up to 32 alphanumeric characters and underscores. Although named subpatterns are also available by their numbers during the RegEx operation itself (e.g. backreferences), they are stored in the output array only by name (not by number). For example, if "Year" is the first subpattern, OutputVarYear would be set to the matching substring, but OutputVar1 would not be changed at all (it would retain its previous value, if any). However, if an unnamed subpattern occurs after "Year", it would be stored in OutputVar2, not OutputVar1.

Most characters like abc123 can be used literally inside a regular expression. However, the characters \.*?+[{|()^$ must be preceded by a backslash to be seen as literal. For example, \. is a literal period and \\ is a literal backslash.

Within a regular expression, special characters such as tab and newline can be escaped with either an accent (`) or a backslash (\). For example, `t is the same as \t.

To learn the basics of regular expressions (or refresh your memory of pattern syntax), see the RegEx Quick Reference.

AutoHotkey's regular expressions are implemented using Perl-compatible Regular Expressions (PCRE) from www.pcre.org.

Related

RegExReplace(), InStr(), IfInString, StringGetPos, SetTitleMatchMode RegEx

Common sources of text data: FileRead, UrlDownloadToFile, Clipboard, GUI Edit controls

Examples

FoundPos := RegExMatch("xxxabc123xyz", "abc.*xyz")  ; Returns 4, which is the position where the match was found.
FoundPos := RegExMatch("abc123123", "123$")  ; Returns 7 because the $ requires the match to be at the end.
FoundPos := RegExMatch("abc123", "i)^ABC")  ; Returns 1 because a match was achieved via the case-insensitive option.
FoundPos := RegExMatch("abcXYZ123", "abc(.*)123", SubPat)  ; Returns 1 and stores "XYZ" in SubPat1.
FoundPos := RegExMatch("abc123abc456", "abc\d+", "", 2)  ; Returns 7 instead of 1 due to StartingPos 2 vs. 1.

; For general RegEx examples, see the RegEx Quick Reference.